Introduction to Open Data Science - Course Project

About the project

This is ‘Open data science’ course. I am familiar with R and R studio. I have used Git earlier but I am not confident enough to use it independently. I am interested in learning about data science and how Statistics can be conveyed to non-statisticians. I was searching for a course not necessarily in Statistics but a course that will talk about data science applicable for any field and some simple ways of learning and using Github. The reviews of earlier students of this course encouraged me to enroll in this course. My github repository for this course: https://github.com/Ashwini-Helsinki/IODS-project


Regression and model validation

First thing is to load required libraries or R packages for this exercise.

# Required libraries
library(GGally)
library(ggplot2)

Read the data

The data created in data wrangling exercise is read as ‘data2’ here. The data is about 166 students. Each row describes one student with 7 columns of information. Information such as age, gender, student’s learning approaches (such as deep learning, surface learning and strategic learning) and attitude towards Statistics is given in various columns of the data. Column ‘Points’ gives exam points of each student.

setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data2 <- read.csv("data/learning2014.csv",header = TRUE)

# Dimension of data: rows and columns respectively.
dim(data2)
## [1] 166   7
# Stucture of data
str(data2)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Graphical overview

Summary of each variable in the analysis data is shown below. For each variable its minimum value, maximum value, 1st, 2nd (median) and 3rd quantile and mean value is shown in the summary.

summary(data2)
##     gender               Age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Let’s examine in detail how each of the variable in the data is distributed and the relationship of these variables with each other.

# create graphical overview of variables with ggpairs()
p <- ggpairs(data2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Above graph shows distribution of each variable in the data and its correlation with other variables in the data. This is a plot matrix with 7 rows and 7 columns. Each row and each column represents one variable from the data.

For all other variables except ‘gender’, density plot is shown in the diagonal position. In all other positions of the same column below the diagonal entry, a scatter plot of joint observations with the variable in the corresponding row are shown. Similarly, in the same row on the right side of the diagonal position shows correlation between the row and column variables.

Since gender variable has only 2 values ‘F’ and ‘M’, above plot has shown histogram for gender variable. Also, its relation with other variables is also shown by histograms (in the first column) and box plots (in the first row) of those variables in each gender.

From the graph, it can be seen that there are more than 100 female students and around 50 male students. Looking at the density plot of ‘Age’, most of the students are of the age less than 30. The long right tail suggests that there are a few students above 30 and up to 60 years of age. Density of ‘attitude’ variable looks normal with span from 1 to 5 with mode around 3. ‘deep’ learning approach has a long left tail. Strategic learning ‘stra’ and surface learning both show slightly bimodal densities. Points has a mode around 22 ans a small amont of observations near 11.

From the correlations, it can be seen that attitude towards Statistics and exam points are positively correlated. It means better the attitude more the exam points. Surface learning is negatively correlated with all other variables.

How these variables are different for two ‘gender’ groups is shown in the following graph. Transparency of the overlapping plots is set by alpha.

# create graphical overview of variables with ggpairs()
p1 <- ggpairs(data2, mapping = aes(col= gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p1

Linear regression

Let’s fit a linear regression model to study effect of three covariates attitude, deep and stra on exam points.

my_model1 <- lm(Points ~ attitude + deep + stra, data= data2)

summary(my_model1)
## 
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

The ‘t value’ in the summary table shows test statistic for the statistical test of parameter (coefficient) corresponding to each covariate (explanatory variable). It tests if the coefficient is zero or not. If the coefficient is away from zero that means the corresponding explanatory variable has impact on the response (target) variable. If the standardize test statistic (t value) is large ( positive or negative) then the corresponding p-value is small; indicating statistical significance of the covariate.

Summary of the model shows that ‘attitude’ and ‘stra’ have statistically significant relationship with the target variable ‘Points’ since the p-values corresponding to them are small. p-value corresponding to co-variate ‘attitude’ is less than 0.001. The estimate of its coefficient is 3.5254 with standard error 0.5683. Increase of 1 unit in attitude increases the exam points by 3.5254 units.

p-value corresponding to ‘stra’ is less than 0.1. It has coefficient estimate of 0.9621 indicating 0.9621 units increase in exam points when ‘stra’ is increased by 1 unit.

Variable ‘deep’ is not showing significant relationship with ‘Points’ (p-value > 0.3) hence, it is removed from the model.

The intercept term has a large estimate of 11.3915. The p-value corresponding to intercept is also very small indicating that some important covariates are not considered in the model.

However, no new covariate will be added at this stage. As per the instructions in the exercise, we will fit the new model by removing ‘deep’.

my_model2 <- lm(Points ~ attitude + stra , data= data2)

summary(my_model2)
## 
## Call:
## lm(formula = Points ~ attitude + stra, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Summary of the new model is shown above. Relationship of attitude and ‘stra’ has not changed much as compared to the previous model. The model can be written as

Points = 8.9729 + (3.4658 * attitude) + (0.9137 * stra) + (Error term)

Error term is considered to be normally distributed. Multiple R-squared is 0.2048. It means this model does not explain variation in the response variable ‘Points’ well.

To study model fitting in more detail, let’s examine various plots of the models.

par(mfrow = c(1,3))
plot(my_model2, which=c(1,2,5))

Residual vs Fitted plot is a scatter plot and no specific relation can be obtained which will indicate possible relation of the fitted value with the residual. Also, the red line is a bit deviated from the horizontal line but the deviation is not large enough. Hence, it can be observed that size of the error does not depend on the explanatory variables as assumed in linear regression.

Second assumption in linear regression is that error term is normally distributed. Second plot ’Normal Q-Q plot shows that most of the observations are close to the straight line. But some observations (number 35, 56,145) and some observations at the top deviate a bit from the straight line. Points in the middle region follow the assumption of normality of error term.

From the Residual vs Leverage plot, it can be seen that observation number 35, 71 and 145 have extreme leverage as compared to other points. Still, the leverage is not outside Cook’s distance hence, there is no single observation impacting the model.


Logistic Regression

First thing is to load required libraries or R packages for this exercise.

# Required libraries
library(dplyr)
library(boot)

Read the data

The data created in data wrangling exercise is read as ‘data3’ here. This data discusses student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The dataset includes the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). G1, G2 and G3 show average grades earned by the student. ‘.math’ and ‘.por’ suffixes show grades in Maths and Portuguese language courses. G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.

setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data3 <- read.csv("data/Alcdata.csv",header = TRUE)

# Dimension of data: rows and columns respectively.
dim(data3)
## [1] 370  47
# variable names of data
colnames(data3)
##  [1] "school"        "sex"           "age"           "address"      
##  [5] "famsize"       "Pstatus"       "Medu"          "Fedu"         
##  [9] "Mjob"          "Fjob"          "reason"        "guardian"     
## [13] "traveltime"    "studytime"     "failures.math" "schoolsup"    
## [17] "famsup"        "paid.math"     "activities"    "nursery"      
## [21] "higher"        "internet"      "romantic"      "famrel"       
## [25] "freetime"      "goout"         "Dalc"          "Walc"         
## [29] "health"        "absences.math" "G1.math"       "G2.math"      
## [33] "G3.math"       "failures.por"  "paid.por"      "absences.por" 
## [37] "G1.por"        "G2.por"        "G3.por"        "failures"     
## [41] "paid"          "absences"      "G1"            "G2"           
## [45] "G3"            "alc_use"       "high_use"

This data is used in this analysis to study the relationships between high/low alcohol consumption and some of the other variables in the data.

Hypothesis

Let’s choose four variables namely ‘sex’, ‘famrel’, ‘absences’ and ‘failures’. My hypotheses about relationships of these variables with high alcohol consumption are as follows:
1. ‘sex’ may not have direct relationship with high alcohol consumption.
2. ‘famrel’ i.e. Good family relationship has negative impact on high alcohol consumption. This is a categorical variable. Better the family relations less alcohol consumption.
3. ‘absences’ and ‘failures’ have positive impact on high alcohol consumption. More the absences more alcohol consumption.

Numerical and graphical exploration

Let’s observe the relationship of these 4 variables with high alcohol consumption.

  1. ‘sex’ is a binary variable and ‘alcohol consumption’ is also a binary variable. So let’s use a 2x2 table and bar graph to look at their relationship.
# 2x2table
table(data3$sex, data3$high_use)
##    
##     FALSE TRUE
##   F   154   41
##   M   105   70
# bar graph
ggplot(data3, aes(x=sex, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)))+ 
          scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies")

Both the tabular view and graph show that sex ‘M’ has more percentage of high alcohol consumption than female group ‘F’. It shows that my hypothesis about relationship of ‘sex’ and ‘high/low alcohol consumption’ is not true.

  1. Family relationship

Let’s look at the tabular and graphical representation.

# tabular representation
table(data3$famrel, data3$high_use)
##    
##     FALSE TRUE
##   1     6    2
##   2     9    9
##   3    39   25
##   4   128   52
##   5    77   23
# bar graph
ggplot(data3, aes(x=famrel, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)), position=position_dodge())+ 
          scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies")

Both the tabular and graph show that for famrel 1 to 3 categories, relative frequency of high alcohol consumption is substantial. However, in category 4 and 5 which indicates better family relationships, has lower percentage of students with high alcohol consumption. Here my hypothesis is somewhat true.

  1. Absences
# tabular representation
table(data3$absences , data3$high_use)
##     
##      FALSE TRUE
##   0     50   13
##   1     37   13
##   2     40   16
##   3     31    7
##   4     24   11
##   5     16    6
##   6     16    5
##   7      9    3
##   8     14    6
##   9      5    6
##   10     5    2
##   11     2    3
##   12     3    4
##   13     1    1
##   14     1    6
##   16     0    1
##   17     0    1
##   18     1    1
##   19     0    1
##   20     2    0
##   21     1    1
##   26     0    1
##   27     0    1
##   29     0    1
##   44     0    1
##   45     1    0
# box plot
ggplot(data3, aes(x=high_use , y=absences))+ geom_boxplot()

Looking at the box plot, it can be seen that students with high alcohol consumption have more absences. The range in both the boxes is similar but the box with whisker in ‘True’ category is much larger and its box has 3rd quantile much above the median. It indicates that absences and high alcohol consumption are positively correlated. My hypothesis is true to an extent.

  1. Failures
# tabular representation
table(data3$failures , data3$high_use)
##    
##     FALSE TRUE
##   0   238   87
##   1    12   12
##   2     8    9
##   3     1    3
# bar graph
ggplot(data3, aes(x=failures , fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)),position=position_dodge())+ 
          scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies")

Here high alcohol consumption increases with more number of failures. My hypothesis is true.

Logistic regression model

Let’s fit a logistic regression model to study effect of these 4 covariates sex, famrel, absences and failures on high/low alcohol consumption. Though famrel and failures can be looked as categorical variables, the data is not equally distributed over all categories so considering any one of the category as base is difficult. Let’s continue with these variables as continuous variable. ‘Sex’ is a binary variable so one of the categories will be treated as base category and other one will be treated in comparison to the base category.

# Fit logistic regression
my_model1 <- glm(high_use ~ sex +  famrel + absences + failures  , data= data3, family =binomial(link = "logit"))


# summary of the model
summary(my_model1)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + absences + failures, 
##     family = binomial(link = "logit"), data = data3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0786  -0.8216  -0.5746   0.9760   2.1820  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.79406    0.54281  -1.463  0.14350    
## sexM         1.04800    0.25091   4.177 2.96e-05 ***
## famrel      -0.29791    0.13044  -2.284  0.02238 *  
## absences     0.08941    0.02274   3.932 8.43e-05 ***
## failures     0.57328    0.20531   2.792  0.00523 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 401.77  on 365  degrees of freedom
## AIC: 411.77
## 
## Number of Fisher Scoring iterations: 4
# Estimate and confidence interval
cbind(coef(my_model1), confint(my_model1))
##                               2.5 %      97.5 %
## (Intercept) -0.79406437 -1.87558320  0.26100198
## sexM         1.04800182  0.56283743  1.54857812
## famrel      -0.29791173 -0.55591186 -0.04251048
## absences     0.08940969  0.04695413  0.13651044
## failures     0.57327802  0.17701334  0.98884313

‘sexM’ estimate gives coefficient of category ‘M’ when the base category ‘F’ is given by intercept term. All 4 covariates have p-value smaller than 0.05 so all 4 covariates show statistical significant relation with high alcohol consumption. Also, all the coefficients are away from zero and the 95% confidence interval exclude zero for all 4 covariates. To view these effects in terms of odds ratio, let’s take exponential of coefficients and confidence interval.

# Odds ratios by exponentiating coefficients of the model
print("Odds ratios - estimate and confidence interval")
## [1] "Odds ratios - estimate and confidence interval"
cbind(exp(coef(my_model1)), exp(confint(my_model1)))
##                           2.5 %    97.5 %
## (Intercept) 0.4520039 0.1532656 1.2982302
## sexM        2.8519467 1.7556470 4.7047758
## famrel      0.7423669 0.5735490 0.9583804
## absences    1.0935286 1.0480739 1.1462668
## failures    1.7740730 1.1936470 2.6881229

Although, ‘absences’ has odds ratio 1.0935286 which is not far away from 1, its confidence interval excludes zero. Hence for all 4 Odds ratio, confidence intervals exclude 1 indicating statistical relation with the target variable high_use. If this result is compared with initial hypotheses. sexM has effect on high_use hence the initial hypotheses is not true. ‘famrel’ have negative effect on high_use since the odds ratio and 95% interval lie below 1. ‘absences’ has positive impact on high_use. ‘Failures’ clearly has relation with high_use. So for these 3 covariates initial hypothesis is true.

AIC of this model is 411.77. Only when AIC of other model is known it can be compared with this AIC to decide a better model. Model with lower AIC is considered as better fit model.

Prediction

Next step is prediction using the model my_model1. predict() function will give probabilities as the prediction for each row in the dataset. The probabilities greater than 0.5 will be considered as ‘TRUE’ (high_use = TRUE or 1) value of prediction and others as ‘FALSE’ (high_use = FALSE or 0)

# predict() the probability of high_use
probabilities <- predict(my_model1, type = "response")

# add the predicted probabilities to the data to get a nwe dataset 'alc'
alc <- mutate(data3, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

Next table and graph shows prediction accuracy in terms of observed value of high_use and predictions.

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   15
##    TRUE     77   34
print("In terms of proportions and with margins")
## [1] "In terms of proportions and with margins"
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65945946 0.04054054 0.70000000
##    TRUE  0.20810811 0.09189189 0.30000000
##    Sum   0.86756757 0.13243243 1.00000000
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

To compute prediction error in terms of average number of wrong predictions, let’s define a loss function as given in the DataCamp.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486

Average number of wrong predictions with the model my_model1 are 0.24 (aprox).

Cross validation

Let’s perform 10-fold cross validation for ‘alc’ data, model my_model1 and the loss function defined above.

# K-fold cross-validation

cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model1, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2540541

With 10 fold cross validation, the average number of wrong predictions is near 0.26 (aprox.) ( The exact value is not mentioned here, since the initial seed is not set here, so each run of cv function gives slightly different value. But all of them are close to 0.26) which is more than the training error 0.24. This is expected since the training error is computed for prediction of observations used for training the model while CV error is obtained from the prediction of the observations which are not used for training the model. Training error can be viewed as error given by only estimation error. The uncertainty due to new unknown data is not there. Hence, CV error is always greater than the training error.

Looking at the cv error, one can claim that performance of this model is similar to that of the model given in DataCamp.

Next, let’s consider large number of predictors to start with and keep reducing 1 predictor in each step based on their deviance. In each step one covariate, which contributes to the error the most, is removed from the model. The process continues till no further improvement is achieved by removing a covariate. This can be performed using ‘step’ function on the full model.

# formula using 26 covariates in the data is created.
BiggerFormula <- as.formula(high_use ~sex + age+famsize+Pstatus+Medu+Fedu+Mjob+
  Fjob+guardian+traveltime+studytime+schoolsup+famsup+activities+nursery+higher+        
  internet+romantic+famrel+freetime+goout+health+failures+paid+absences+G3)

# glm is fitted to the bigger formula.
FullModel <- glm(BiggerFormula,family=binomial(link="logit"),data=data3)

# For loop to reduce 1 covariate in every step.
finalDF <- c() # to save final result
numcov = 26 # initial number of covariates

for( mm in 1:26){
  
  prevnumcov= numcov
  
  # perform stepwise backward elimination with steps = mm 
  Newmodel <- step(FullModel,direction="backward",trace=FALSE, steps =mm)
  
  numcov=length(coef(Newmodel))-1
  
  if(numcov == prevnumcov){ 
    print(paste0("No further improvement after ",prevnumcov, " variables." ))
    break
  }
  # predict() the probability of high_use
  probabilities <- predict(Newmodel, type = "response")
  
  # add the predicted probabilities to the data to get a nwe dataset 'alc'
  alc1 <- mutate(alc, probability1 = probabilities)
  
  # use the probabilities to make a prediction of high_use
  alc1 <- mutate(alc1, prediction1 = probability1 > 0.5)
 
  trainerror = loss_func(class = alc1$high_use, prob = alc1$probability1)
  
  cv <- cv.glm(data = alc1, cost = loss_func, glmfit = Newmodel, K = 10)
  
  # average number of wrong predictions in the cross validation
  newcv = cv$delta[1]
  finalDF <- rbind(finalDF, c(numcov,trainerror,'Training'),c(numcov,newcv,'Prediction'))
}
## [1] "No further improvement after 9 variables."
finalDF = data.frame(finalDF)
colnames(finalDF) <- c('Number of covariates', 'Error', 'Type')

ggplot(finalDF, aes(x= `Number of covariates`, y=as.numeric(Error), col=Type)) + 
  geom_point() + scale_y_continuous(labels = scales::number_format(accuracy = 0.01))

For models with large number of covariates, training error is smaller but prediction error is large. Models with properly chosen less number of covariates have training error not largely different but much lesser prediction error.


Clustering and classification

First thing is to load required libraries or R packages for this exercise. Also set random seed for repeatability of the results.

# Required libraries
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
set.seed(12345)

Read the data

‘Boston’ data from ‘MASS’ package is read here. The data is about housing values and factor affecting housing values in Suburbs of Boston. The data has 506 rows and 14 columns. Each row has 14 aspect of the suburb line residential area, business area, pupil-teacher ratio, taxes etc. Let’s explore the data in more details.

# load the data
data("Boston")

# Structure of the dataset 

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Dimension of the dataset
dim(Boston)
## [1] 506  14

Summaries and distributions of the variables.

# Summary the dataset
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# distribution of variables and their relationships with each other
pairs(Boston)

# Better graphical presentation with ggpairs()
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Distribution of all the variables can be seen from the above plot. ‘rm’ i.e Average number of rooms per dwelling has a nice bell shape curve suggesting normal distribution with mean 6.208. Variables ‘dis’, ‘Istat’ and ‘medv’ have Normal distribution curve with longer right tail. ‘indus’ and ‘tax’ seems bimodal.

To explore the relationships between the variables let’s use ‘corrplot’

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00

‘medv’(median value of owner-occupied homes in $1000s.) shows correlation with all the variables. Except variable ‘chas’ (binary variable ‘tract bounds river’), all other variables have good correlation (positive or negative) with each other.

From the colorful corrplot, it can be seen that the highest positive correlation is between ‘tax’ (full-value property-tax rate per $10,000.) and ‘rad’ (index of accessibility to radial highways).

dis (weighted mean of distances to five Boston employment centres) has strong negative correlation with indus(proportion of non-retail business acres per town), age(proportion of owner-occupied units built prior to 1940) and nox(nitrogen oxides concentration (parts per 10 million)) indicating that near the employment centres there is higher proportion of non-retail business, higher proportion of owner-occupied units built prior to 1940 and higher nitrogen oxides concentration. As expected, Istat(lower status of the population (percent)) and medv(median value of owner-occupied homes in $1000s) are also highely negatively correlated.

‘chas’ (binary variable ‘tract bounds river’) has hardly any correlation with other variables. It is clear that although in the human history, human beings chose to stay near rivers or water sources in the ancient days, today it has less impact on choosing a housing.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling, all the variable ranges have reduced. Most of the 1st and 3rd quantiles are between (or near) -1 and 1. Next step is creating categorical variable of crime with each category indicating level of crime. Then the original variable ‘crim’ will be removed and this categorical variable will be added as a new column.

# create data frame object
boston_scaled <- as.data.frame(boston_scaled)

# create a categorical variable 'crime' with cut points as quantiles of the variable 'crim'
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Let’s create train data by randomly selecting 80% of the rows of the data. Remaining rows will be treated as test data.

# choose randomly 80% of the rows for training set
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Linear discriminant analysis (LDA)

Linear discriminant analysis is performed on train data using crime as target variable and all other variables as covariates.After LDA fit is obtained the biplot is plotted.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2648515 0.2500000 0.2376238 0.2475248 
## 
## Group means:
##                  zn      indus          chas        nox          rm        age
## low       1.0411093 -0.9021842 -0.1251477505 -0.8847044  0.40705116 -0.8865465
## med_low  -0.0833036 -0.2819395  0.0005392655 -0.5730096 -0.10985904 -0.3410669
## med_high -0.3871935  0.2004107  0.1787970010  0.4091306  0.04385536  0.4041791
## high     -0.4872402  1.0171519 -0.1148450583  1.0581412 -0.39225400  0.8234116
##                 dis        rad        tax     ptratio       black      lstat
## low       0.9165942 -0.6974378 -0.7426399 -0.41762332  0.38119357 -0.7516720
## med_low   0.3466853 -0.5406779 -0.4979925 -0.02930992  0.32091020 -0.1609564
## med_high -0.3803789 -0.4925764 -0.3679189 -0.34561698  0.07515175  0.0204888
## high     -0.8428566  1.6377820  1.5138081  0.78037363 -0.76621760  0.9513763
##                medv
## low       0.4874449
## med_low   0.0085978
## med_high  0.1667764
## high     -0.7788111
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.09932684  0.660082422 -0.924446780
## indus    0.20684196 -0.173509150  0.355125653
## chas    -0.03474956 -0.074291398  0.125240743
## nox      0.26876697 -0.748634714 -1.377750333
## rm       0.07723357  0.012536636 -0.123701216
## age      0.08442120 -0.325512288 -0.120973431
## dis     -0.08550872 -0.218908959  0.157590088
## rad      4.49240099  0.992157796  0.006802416
## tax      0.15738914 -0.102939131  0.427837042
## ptratio  0.10718133 -0.003208959 -0.226272327
## black   -0.02962752  0.097919438  0.165267471
## lstat    0.21600800 -0.222397929  0.317757617
## medv     0.03067033 -0.494810214 -0.193725250
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9702 0.0230 0.0068
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

‘rad’ has a longest arrow so it is the most discriminant among all the predictors. ‘rad’ can differentiate between crime categories well.

LDA prediction

We would like to predict ‘crime’ categories for the test data so keep a copy of true ‘crime’ categories of the test data in ‘correct_classes’ and then delete ‘crime’ column from test data.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13       7        0    0
##   med_low    5      16        4    0
##   med_high   0       9       16    5
##   high       0       0        0   27

The cross tabulation show that category ‘high’ in the test data is very well predicted. Category ‘med_high’ shows worst categorization prediction. Out of 30 ‘med_high category in the test data, only 16 are correctly predicted. ’low’ and ‘med_low’ categories are predicted 2/3 times correctly.

K-means clustering

Let’s consider the Boston data again. Compute various distances like euclidean distance, manhattan distance on the scaled Boston data.

data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)


# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of euclidean distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')

# look at the summary of manhattan distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Even on the scaled data, two different distances have very different range. Let’s perform k-means clustering with 3 clusters.

# k-means clustering
km3 <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km3$cluster)

Overall, the graphs show groups of 3 colors but all 3 colors are not seen in all the graphs. Also, some times they do not properly form clusters.

To find appropriate number of clusters, consider k= 1 to 10 all 10 values one by one.

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot shows that from cluster 1 to 2 there is large improvement. There is not much further improvement. If we are interested in very small SS (sum of squares) then one can consider more clusters. However, having too many clusters is not always useful if they do not actually differ in all the variables. Let’s go ahead with k =2 which has good improvement over k=1.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Almost all the smaller graphs show good separation of 2 groups with 2 colors. Hence, k=2 was a good choice.

LDA on k-means clusters

Let’s use k=4 to generate clusters. Then perform LDA to check how these 4 clusters are analyzed using linear discrimination analysis.

# k-means clustering
km4 <-kmeans(boston_scaled, centers = 4)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km4$cluster)

boston_scaled1 = data.frame(boston_scaled)
boston_scaled1= boston_scaled1 %>%
  mutate(cluster =km4$cluster )

# linear discriminant analysis
lda.fit4 <- lda(cluster ~ ., data = boston_scaled1)

# print the lda.fit object
lda.fit4
## Call:
## lda(cluster ~ ., data = boston_scaled1)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2727273 0.4051383 0.1185771 0.2035573 
## 
## Group means:
##         crim         zn      indus        chas        nox         rm        age
## 1  0.9756821 -0.4872402  1.0939296 -0.21526964  0.9652909 -0.4642740  0.7722565
## 2 -0.3888773 -0.3531707 -0.3412963 -0.27232907 -0.4124824 -0.1646322 -0.1241405
## 3 -0.2131078 -0.3300240  0.4860617  1.49936604  0.9890166  0.1347329  0.7475175
## 4 -0.4091050  1.5479668 -1.0695170 -0.04298342 -1.0484685  0.8712179 -1.2230451
##          dis        rad        tax    ptratio       black       lstat
## 1 -0.8278800  1.4598695  1.4701648  0.8275348 -0.71273723  0.93140384
## 2  0.1406541 -0.5964345 -0.6080622  0.1676730  0.31637467 -0.15551283
## 3 -0.7281887 -0.2889628 -0.1777283 -1.2881927 -0.04951573  0.01817238
## 4  1.2534434 -0.6005355 -0.6559834 -0.6920507  0.35409586 -0.94897033
##          medv
## 1 -0.78074928
## 2 -0.08063264
## 3  0.47647538
## 4  0.92897641
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## crim     0.012656810  0.02428702 -0.14961045
## zn       0.095335547  0.40053401 -1.13758796
## indus   -0.504615517 -0.19850861 -0.24468003
## chas     0.157622073 -0.86698142 -0.33492529
## nox      0.225995414 -0.89706014 -0.52034056
## rm       0.021371579  0.30917185 -0.33017316
## age     -0.167677382 -0.43903833  0.37200499
## dis      0.362302731 -0.13797266 -0.43019018
## rad     -1.477515958  0.41995946 -0.16247733
## tax     -0.832279137  0.18854689 -0.61346017
## ptratio  0.007751927  0.89661450  0.30890506
## black    0.005638473  0.08388227  0.07581497
## lstat   -0.199096514  0.28003158 -0.34738897
## medv     0.217842616 -0.11520682 -0.55811203
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.6775 0.1791 0.1435
# target classes as numeric
classes <- as.numeric(boston_scaled1$cluster)

# plot the lda results
plot(lda.fit4, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit4, myscale = 2)

Variables with longer arrows in the plot are the best linear discriminants for these 4 groups.

Visualization using plotly

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)